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Abstract 

A central task in the study of molecular sequence data from present-day species is the reconstruction 
of the ancestral relationships. The most established approach to tree reconstruction is the maxi- 
mum likelihood (ML) method. In this method, evolution is described in terms of a discrete-state 
continuous-time Markov process on a phylogenetic tree. The substitution rate matrix, that deter- 
mines the Markov process, can be estimated using the expectation maximization (EM) algorithm. 
Unfortunately, an exhaustive search for the ML phylogenetic tree is computationally prohibitive 
for large data sets. In such situations, the neighbor-joining (NJ) method is frequently used because 
of its computational speed. The NJ method reconstructs trees by clustering neighboring sequences 
recursively, based on pairwise comparisons between the sequences. The NJ method can be general- 
ized such that reconstruction is based on comparisons of subtrees rather than pairwise distances. In 
this paper, we present an algorithm for simultaneous substitution rate estimation and phylogenetic 
tree reconstruction. The algorithm iterates between the EM algorithm for estimating substitution 
rates and the generalized NJ method for tree reconstruction. Preliminary results of the approach 
are encouraging. 

1 Introduction 

Current efforts to reconstruct the tree of life for different organisms demand the inference of phy- 
logenies from thousands of DNA sequences. The first author is at a university where the tree of 
life for files is investigated ( http : // www. inhs . uiuc . edu /cee /FLYTREEYt and the second author is 
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at a university where the tree of hfe for fungi is considered (http:/ /ocid. nacse.org/research/aftol/ |. 
For such large data sets the tree space is enormous and identification of the optimal tree is a major 
challenge. 

The evolution of homologous DNA sequences can be described by continuous time Markov chains 
on a phylogenetic tree jS]. A continuous time Markov chain is characterized by a substitution rate 
matrix, and the phylogenetic tree summarizes the relationships between the species in terms of edge 
lengths (times since divergence) and common ancestors. The DNA sequences are only observed 
in the leaves, and information on the phylogenetic tree, substitution events (time and type) and 
edge lengths is missing. The transition matrix P{t) for a continuous time Markov process can be 
written as exp(Qt), where Q is a parameterized substitution rate matrix. In order to estimate the 
rate parameters and edge lengths for a fixed tree and based on the observed data, one can use 
the expectation maximization (EM) algorithm JU]. The updating step of the EM algorithm can 
be written explicitly in terms of eigenvalues and eigenvectors of Q. The continuous time Markov 
process gives rise to a distance measure between any sets of sequences. Pairwise distances can 
be used, together with the neighbor joining (NJ) algorithm, to reconstruct the phylogenetic tree 
that relates the sequences. The generalized neighbor joining (GNJ) algorithm improves the NJ 
algorithm by using distance measures based on subtrees rather than pairwise distances. 

In this paper, we describe an algorithm that simultaneously estimates the substitution rate 
matrix and reconstructs the phylogenetic tree. The algorithm iterates between the EM algorithm 
for rate matrix estimation and the GNJ algorithm for phylogenetic tree reconstruction. We are in 
the process of implementing the algorithm, and preliminary results are encouraging. 



2 The EM algorithm 

2.1 Two sequences 

Consider the somewhat unreasonable situation where we have access to the complete observation 
(the hidden model or the complete data model) of the evolution of a single site. The state of the 
process at time t is denoted x{t), and we have observed the process from time t = to time t = T. 
In this paper the size a of the state space is 4 corresponding to the four nucleotides S ={A,G,C,T}, 
but our method can also be applied to protein-coding sequences where the state space is of size 61 
because there are 61 sense codons. Continuous-time Markov processes are described in e.g. [2]. 

We now derive the likelihood for a complete observation of a continuous-time Markov process 
with substitution rate matrix Q. Suppose the evolution of a single site has been completely observed 
from time t = to time t = T and is given as in Fig. 1. We model the evolution in terms of a 
continuous-time Markov process with substitution rate matrix Q. Recall that a rate matrix has 
non-negative off-diagonal entries and each row sums to zero. Let Q{a,b) be the entry of Q in 
the ath column and the 6th row. The waiting time in a state a is exponentially distributed with 
parameter —Q{a,a) and the probability of substituting a with b is proportional to Q{a,b). Thus, 
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the likelihood for the complete observation in Fig. 1 is given by 
L{Q) 



Q(l,l) '^'"'^ Q(3,3)~^^*'^^^ Q(l,l) 

Q(l,l)(ii+{t3-t2))+Q{2,2){T-t2)+Q{3,3){t2-ti)g(-i^ 2)(5(1, 3)(5(3, 2), 



where indices 1,2,3,4 corresponds to A,G,C,T. 
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Figure 1: Complete observation of the evolution of a single site in a DNA sequence. 

More generally, if the rate matrix is parametrized by ^, Q = Qg, and with x = {x{t) : < t < 
T}, the maximum likelihood estimation problem for a complete observation is: 



maximize L{6; x) 



b) 



subject to 9 e@. (1) 



Here fabiO) and f'g^{0) are functions from M'^ ^ M+ = {x € M : x > 0} such that fab{0) = 
Qg{a, a), f^iO) = exp((56)(a, a)) for € © C M'^, T{a) is the total time spent in state a, and N{a, h) 
is the number of substitutions of a with h. Thus, the log-likelihood for a complete observation 
becomes 



log L{e- x) = Y, T{a)Qe{a, a) + N{a, b) log Qe[a, b). 

aeS aeS bj^a 



(2) 



Note that the complete log-likelihood is linear in the total time spent in a state and the number of 
substitutions between states. The complete log-likelihood function is analytically tractable. Indeed, 
as argued in JUj , in most Markov processes of sequence evolution the maximum likelihood estimates 
can be found analytically from a complete observation. 

Consider for example the general time reversible (GTR) model. Let tTq, a G S, X^a'^i ~ 
denote the stationary distribution of the Markov chain. This distribution can be estimated from 
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the nucleotide frequencies in a single sequence. The GTR model has substitution rate matrix (e.g., 

HOI) 



dl2'^2 duT^S Ow^i 
Oii-Ki 624'^2 6'347r3 



(3) 



where the diagonal elements are such that each row sums to zero and the 6 unknown parameters 
are 9 = (^i2i ^13, ^14, ^23> ^24) ^34)- A simple calculation shows that the complete log-likelihood (O 
is maximized for 

_ N{a,b) + N{b,a) 

7r,r(a) + 7r„T(6) ' " < 

The problem is, however, that the DNA sequences are only observed in the leaves, whereas 
information on substitution events (time and type) and edge lengths is missing. For two sequences 
we only observe the beginning state x{0) and the final state x{T) of the process. We have a missing 
data problem in the sense that we only have access to part of the data. 

The Expectation Maximization (EM) algorithm |2] is a broadly applicable approach for missing 
data problems. The algorithm is an iterative procedure with two steps in each iteration, the 
Expectation Step (E-step) and the Maximization Step (M-step). The algorithm approaches the 
problem of solving the actually observed log-likelihood indirectly by proceeding iteratively in terms 
of the complete log-likelihood. As the complete log-likelihood is unobservable, it is replaced by its 
conditional expectation given the observed data y = {x{0), x{T)}. In the E-step of the {k + l)'th 
iteration, the function G{9; 6^) = Egt [log L{6] x) \y\ is calculated, and in the M-step a new parameter 
value 9^^^ is obtained as the value of 9 that maximizes G{9; 9^). For the GTR model ® the M-step 
becomes 

^ Eg..[jV(a,5)|x(0),x(r)] +Eg.[Ar(b,Q)|x(0),rr(T)] 
'''^^ 7rfeE,.[r(a)|x(0),a;(T)]+^,Ee.[r(6)|x(0),x(T)]' " 

The algorithm converges to a local maximum likelihood estimate of the observed data. 

Since the complete log-likelihood is linear in the time spent in a state and the number of 
transitions between states, and expectation is a linear operator, all we need in the E-step is to 
calculate conditional expectations of these two quantities given the observed data. The conditional 
expectations are provided in the following theorem. 

Theorem 1 (Conditional expectations Consider a continuous time Markov process 

{x{t) : < t < T} on a finite state space with rate matrix Q. Denote the transition matrix 
P{t) = exp{Qt). We have the following conditional expectations: 

• Time spent in state a 

E[r(a)|x(0) = i,x{T) = j] = r Pia{t)Paj{T - t)dt/P,,{T). 

Jo 
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• Number of transitions between states a and b 

E[N{a,b)\xiO) = i,xiT) = j] = Qia,b) P^mjiT - t)dt/Pij{T). 

Jo 

The transition probability matrix P{t) = exp{Qt) with entries Pah{T) = P{X{T) = b\X{0) = 
a) is calculated using an eigenvalue decomposition of Q. Let U be the orthogonal matrix with 
eigenvalues as columns and Dx the diagonal matrix of corresponding eigenvectors such that Q = 
UDxU-^ and P(T) = e^'^ = Ue^'^^lJ-^. In the most general case some of the eigenvalues and 
eigenvectors are complex. Conditional expectations are now found from 

r Pab{t)Pcd{T - t)dt = V UaiUr^^ V UcjUr/Jij, where Jij = e^^^ C e'^^^-^^Ut. 
Jo i ^ Jo 

Even when the eigenvalues are complex, the Jij integrals are easy to evaluate. 

In the case of multiple sites that are identically distributed and evolve independently we can 
summarize the data in a frequency table i^iijj) that summarizes the number of sites with x(0) = i 
and x{T) = j. In this case the complete log-likelihood conditional on the observed data becomes 

G(0;0'=) = J]^z/(i,j)E,.[logL(0;x)|x(O) =i,x(r) = j]. 

i j 

The linearity in the time spent in a state and the number of jumps between states is maintained and 
the EM-algorithm remains simple. Indeed, the conditional expectations are given by Theorem^ 
and the M-step for the GTR model becomes 

_ E^EJ'^i^,me4Nia,b)\xiO)=i,x{T)=j]+Eg,[Nib,a)\xiO)=i,x{T)=j]) 
E^EJ''ihj)i^bEg,[T{a)\x{0)=i,xiT)=j]+7^aEg,[T{b)\x{0)=i,x{T)=j]) • 

Note that because we assume P{t) = exp{Qt), t and Q are confounded. Indeed, Qt = 
(2Q)/(i/2), which means that twice the rate at half the time has the same results. To avoid 
confounding, the rate matrix is calibrated such that time corresponds to expected changes per site. 
This means that ^i^j^iTTiQij = 1. 

2.2 Multiple sequences 

Now consider the case of multiple sequences related by an unrooted phylogenetic tree with n leaves 
(terminal nodes), n terminal edges, n — 2 internal nodes and n — 3 internal edges. The single site 
complete log-likelihood becomes 

2n-3 

logL(0;x)= ( j;r*(a)Q*(a,a) + 2] j;iV'(a,6)logQ^(a,6) 

i=l aGE aeE bj^a 

where T^{a) is the total time spent in state a on edge i and N^{a, b) is the number of transitions from 
a to 6 on edge i. Letting y = {y^, . . . ,y^) be the observed data at the leaves and letting a{i),d{i) 
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be the ancestral and descendant values at the two ends of the edge with descendant node i we get 
from the Markov property 



2n-3 

i=l a{i),d{i) agS aGS bj^a 

Here Pgk {a{i) , d{i)\y) is the probability of observing the ancestral value a(i) and descendant value 
d{i) at the edge with descendant node i given the data y. These probabilities are calculated using 
Felsenstein's peeling algorithm [Hj. In the E-step we therefore need to calculate conditional mean 
values on each edge. Conditioning on a{i) and d{i), the conditional mean values are determined by 
Theorem ^ 

In this paper we consider the case where the rate matrix is the same on all edges, but the 
length varies between edges. For example, the rate matrix Q could be the GTR rate matrix (jSJ 
calibrated. The rate matrix on each edge i is then given hy = WiQ, i = 1, . . . , 2n — 3, where 
Wi is the length of the edge with descendant node i. The M-step is slightly more complicated for 
multiple sequences compared to pairwise sequences, but can be carried out as described in JUj. 



3 Generalized neighbor-joining 

We will consider binary trees, meaning that all terminal nodes are of degree one and all internal 
nodes are of degree three. Given a tree, its edge lengths are said to be additive if the distance 
between any pair of leaves is the sum of the lengths of the edges on the path connecting them. 
Homogeneous finite-state continuous time Markov models satisfy 

Pik{tl)Pkj{t2) = Pi.ih + t2). (4) 

k 

Define the maximum likelihood distance |Sj between a pair of leaves by 

N 

d{i,j) = argmaxjj JJPj^.(^) 

s=l 

where = (y*(l), . . . ,y^{N)) is the DNA sequence of length N observed at leaf i. Suppose the 
leaves i and j are connected by node k. Using @ and consistency of maximum likelihood we get 
d{i,j) ~ ii + t2- Extending this argument to a general phylogenetic tree we see that maximum 
likelihood distances based on continuous time Markov chains between leaf sequences should be close 
to additive if there is enough data to obtain reliable estimates of the pairwise distances. 



3.1 Saitou-Nei neighbor-joining method 

Given a binary tree T with additive edge lengths, we can reconstruct it from the pairwise distances 
d{i,j) of its leaves {i, j} using the neighbor-joining method of Saitou and Nei (1987). Firstly, pick 
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a cherry {i, j} in the tree, i.e. leaves that have the same parent node I. Secondly, remove the leaves 
i and j from the set of leaves and add I, defining its distance to any other leaf k by 

dil,k) = ^{d{i,k) + d{j,k)-d{i,j)) 

and edge lengths Wi and Wj of edges with descendant nodes i and j, respectively, by 



d{i,j) + 



1 



n - 2 



id{i,k)-d{j,k)) 



d{i,j) - Wi 



(5) 
(6) 



Repeat the procedure until the number of leaves n is 3. The main problem is picking the cherry, 
and a solution was suggested in Saitou and Nei JJj and modified by Studier and Keppler |18j . 

Theorem 2 (Cherry picking criterion jl7^ 118] ). Let d be an additive tree metric and define 
the n X n-matrix B with entries 

Bii, j) = (n - 2)d(i, j) - E ^) - E ^) • 



Then the pair of leaves {i* ,j*} that minimizes B is a cherry in the tree T. 



3.2 The generalized neighbor-joining method 

It is natural to ask if we can generalize the neighbor-joining method based on pairwise distances 
to distances based on subtrees. Here, a subtree is a set of leaves and a subtree distance is the 
sum of the lengths of the edges on the path connecting the subtree leaves. Distances based on 
subtrees are expected to be more accurately determined than pairwise distances because of the 
following reasons: (1) they are calculated from more data, and (2)we include more conditions 
to the calculation. (When we calculate pairwise distances via MLEs, we assume that each path 
containing a pair is independent, while they are not independent because each path share some 
branches with the other.) Pachter and Speyer show that binary trees with additive edge lengths 
are indeed determined from distances based on subtrees. Let Q be set of leaves in the tree T and 
let (^) be the set of all m-subsets of ^l. Then we denote D{R) for R G (^) the subtree distance of 
the subtree containing R € (^) . 

Theorem 3 (Subtree distance condition |l"3ij ). LetV be binary tree with additive edge lengths. 
Let m, 2 < m < {n + l)/2, be the size of each subtree. The tree T is uniquely determined from the 
set {D{R): ReQ}. 

Pachter and Speyer do not describe how to reconstruct the tree and calculate the edge lengths 
from the subtree distances. The tree reconstruction can be found in and is a two-stage proce- 
dure. 



7 



Theorem 4 (Equivalent topologies Suppose that {i, j} is a pair of leaves in the set of 

leaves with n = \Q\ and suppose that 2 < m < n — 2. Define the pairwise distances 



dii,j)= Di{i,j,A}). 



(7) 



m — 2 J 



The tree T based on the additive tree metric d has the same topology as the tree T based on the 
additive tree metric D. 

We can thus find the topology of the tree T by using the cherry picking criterion in Theorem [21 
on the pairwise distances d{i,j). Using equations © and © we can also find the edge lengths Wi 
for any edge Ci in the tree T. In the map between the edge lengths in F and the edge lengths 
in r is also described. The map between the edge lengths in the two metrics is linear and one-to-one 
and given as follows. 

Theorem 5 (Map between edge lengths Let T, F, m and n as in Theorem^ and let 

Lj{e) denote the set of leaves in the component ofT — e (or equivalently T — e) that contains leaf j 
after removing an edge e. 



1. Let e,- 



n + 1, . . . ,2n — 3, denote the internal edges of T with lengths Wi and let Ci be the 



corresponding internal edges of T with lengths Wi . Then 

2wi 



Wi 



f\Laiei)\-2\ 
y m-2 



n + 1, . . . ,2n — 3, 



where a and c are the leaves on opposite sides of the edge ej . 
2. Denote the terminal edges ofT 6?/ ei, . . . , e„ with corresponding edges ei, 



,en in r. Let 



2n-3 

E 

j=n+l 



n - 
m 



\L,{ej)\-2 
m — 2 



Wi 



Then 







/ 2wi 


-Ci 








1- 








, where A 


\ Wn J 




\ 2t()„ 









1 



m 



J 



(m- l)(n-2) 

Here, I is the identity matrix and J is the matrix consisting entirely of Is. 

The running time for computing the subtree distances is 0{Ln"^) where L is the length of 
the alignment and the computation of the distance matrix d is 0{n"^) (both steps are trivially 
parallelizable) . The subsequent neighbor-joining is 0{n^) and edge weight reconstruction is 0(?i^). 
It is interesting to note that for fixed L the running time of the algorithm is O(n^) for both m = 2 
and m = 3. 
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4 The EMGNJ algorithm 



Suppose we wish to estimate the GTR model. The EMGNJ algorithm consists of two main parts, 
reconstructing a tree via the GNJ method and improving GTR rates via the EM algorithm. We 
iterate these steps until it converges. 

Algorithm 6 (The EMGNJ algorithm). Suppose we have n DNA sequences and an integer 
2 < m < n - 2. 

Input.- n DNA sequences and an integer 2 < m < n — 2. 
Output.' The GTR rates and a phylogenetic tree. 

1. Estimate stationary distribution from empirical frequencies. 

2. Reconstruct tree using M JOIN under the Jukes-Cantor (JC69) model. 

3. Estimate GTR substitution rates and edge lengths from current tree. 

4. Reconstruct tree using MJOIN and current GTR rates. 

5. If likelihood is not improved return current tree and GTR rates; otherwise go to 3. 

In other words, we provide starting values of the algorithm in Step ^ and Step [21 and iterate 
Step 01 and Step until convergence. 

5 Computation 

We implemented subroutines of the EMGNJ algorithm, StepOand Stepjllwith m = 4 under the JC 
model. We plan that a full implementation of the algorithm will be released soon. We applied our 
implementation to find the phylogenetic tree for 21 S-locus receptor kinase (SRK) sequences from 
jl4j involved in the self/nonself discriminating self-incompatibility system of the mustard family 
described in P^ . 

We sampled 10, 000 trees from a Markov chain with stationary distribution proportional to the 
likelihood function by means of a Markov chain Monte Carlo (MCMC) algorithm implemented 
in PHYBAYES 0. We then compared the tree topology of each tree generated by this MCMC 
method with that of the reconstructed trees via Step |21 and Step-in the EMGNJ method, Saitou- 
Nei NJ method, fastDNAml, DNAml from PHYLIP package by 0, and TrExML ^ under their 
respective default settings with the JC69 model. We used treedist from PHYLIP to compare two 
tree topologies. If the symmetric difference A between two topologies is 0, then the two topologies 
are identical. Larger A's are reflective of a larger distance between the two compared topologies. 
Table ^ summarizes the distance between a reconstructed tree and the MCMC samples from the 
normalized likelihood function. For example, the first two elements in the third row of Tablenmean 
that 171 out of the 10, 000 MCMC sampled trees are at a symmetric difference of 4 (A = 4) from 
the tree reconstructed via Saitou-Nei NJ method (with pairwise distance). DNAml was used in 
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Table 1: Symmetric difference (A) between 10,000 trees sampled from the likelihood function via 
MCMC and the trees reconstructed by 5 methods. sub-EMGNJ means the implementation of 
subroutines, Step 01 and Step 01 



A 


sub-EMGNJ 


Saitou-Nei NJ 


fastDNAml 


DNAml(A) 


DNAml(B) 


TrExML 














2 


3608 





2 


77 








1 


471 





4 


3616 


171 


6 


3619 


5614 





6 


680 


5687 


5 


463 


294 


5 


8 


5615 


4134 


3987 


5636 


13 


71 


10 


12 


8 


5720 


269 





3634 


12 








272 


10 





652 


14 








10 








5631 


16 
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two ways: DNAml(A) is a basic search with no global rearrangements, whereas DNAml(B) applies 
a broader search with global rearrangements and 100 jumbled inputs. The fruits of the broader 
search are reflected by the accumulation of MCMC sampled trees over small A values from the 
DNAml(B) tree. Note that the tree topology for the reconstructed tree via Step 01 and Step jll in 
the EMGNJ method and one via the GNJ method (MJOIN have the same tree topology. 

6 Conclusion 

We expect to improve the results using a full implementation of the EMGNJ method for two main 
reasons: (1) Even though currently we have an implementation with subroutines Step |21 and Step 
01 the result is improved compared to Saitou-Nei NJ method and to fastDNAml as in Tabled (2) 
Iterating Step O and Step jH should improve the GTR rates. 

We want to investigate the number of iterations until the output from the EMGNJ method 
converges. It is also of interest to study the results with different values of m. 

7 Discussion 

A potential problem is that the EM algorithm often get stuck in local maxima. We might be able to 
avoid this problem using Moore Rejection Sampling JH]- We sample data via the sampler for each 
subtree and we reconstruct trees with a set of samples under the JC69 model. Then we estimate 
the GTR rates and trees via the EMGNJ method. At the end, we compare the likelihood values of 
these trees and we take the tree with the biggest likelihood value. One notices that the process to 
reconstruct trees via the EMGNJ method from samples can be parallelizable. 
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